Telecom Churn Prediction Analysis

1 Chapter 1: Introduction

Customer churn is a key business concept that determines the number of customers that stop doing business with a specific company. Telecom industries have undergone a high pace of market liberalization and growth in recent times. In industries highly competitive like this, companies who want to maintain a sustainable business will have to address the churn rate.
For this purpose, we are using a dataset of a major telecom company whose postpaid business of voice-only plans is struggling to maintain its strong foothold in the local market due to:

  • High churn rate among customers leading to a revenue decline of ~500k USD every month.
  • The decline in overall customer base (high churn rate combined with low acquisition rate), leading to a decline in total market share.

Our analysis will mainly focus on finding out what factors are affecting the churn rate.

The report is organized as follows:

  1. Description of the Data
  2. Univariate Analysis of the variables
  3. SMART Questions Analysis
  4. Correlation Analysis
  5. Conclusion

2 Chapter 2: Description of the Data

2.1 Source Data

For our analysis, we found a dataset on Kaggle(“churn_analysis_md”). Our dataset contains 51,047 observations and 58 columns. Our data is imbalanced and since it is a representation of the real world we will be using it without balancing for our initial EDA. Our data is majorly classified into 4 categories: Demographics, Service Usage, Finance, Marketing.

Here is the structure of our data:

## 'data.frame':    51047 obs. of  58 variables:
##  $ CustomerID               : int  3000002 3000010 3000014 3000022 3000026 3000030 3000038 3000042 3000046 3000050 ...
##  $ Churn                    : chr  "Yes" "Yes" "No" "No" ...
##  $ MonthlyRevenue           : num  24 17 38 82.3 17.1 ...
##  $ MonthlyMinutes           : int  219 10 8 1312 0 682 26 98 24 1056 ...
##  $ TotalRecurringCharge     : int  22 17 38 75 17 52 30 66 35 75 ...
##  $ DirectorAssistedCalls    : num  0.25 0 0 1.24 0 0.25 0.25 2.48 0 0 ...
##  $ OverageMinutes           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ RoamingCalls             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PercChangeMinutes        : int  -157 -4 -2 157 0 148 60 24 20 43 ...
##  $ PercChangeRevenues       : num  -19 0 0 8.1 -0.2 -3.1 4 6.8 -0.3 2.4 ...
##  $ DroppedCalls             : num  0.7 0.3 0 52 0 9 0 0 0 0 ...
##  $ BlockedCalls             : num  0.7 0 0 7.7 0 1.7 1 0.3 0 0 ...
##  $ UnansweredCalls          : num  6.3 2.7 0 76 0 13 2.3 4 1 0 ...
##  $ CustomerCareCalls        : num  0 0 0 4.3 0 0.7 0 4 0 0 ...
##  $ ThreewayCalls            : num  0 0 0 1.3 0 0 0 0 0 0 ...
##  $ ReceivedCalls            : num  97.2 0 0.4 200.3 0 ...
##  $ OutboundCalls            : num  0 0 0.3 370.3 0 ...
##  $ InboundCalls             : num  0 0 0 147 0 0 0 0 1.7 0 ...
##  $ PeakCallsInOut           : num  58 5 1.3 555.7 0 ...
##  $ OffPeakCallsInOut        : num  24 1 3.7 303.7 0 ...
##  $ DroppedBlockedCalls      : num  1.3 0.3 0 59.7 0 10.7 1 0.3 0 0 ...
##  $ CallForwardingCalls      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CallWaitingCalls         : num  0.3 0 0 22.7 0 0.7 0 0 0 0 ...
##  $ MonthsInService          : int  61 58 60 59 53 53 57 59 53 55 ...
##  $ UniqueSubs               : int  2 1 1 2 2 1 2 2 3 1 ...
##  $ ActiveSubs               : int  1 1 1 2 2 1 2 2 3 1 ...
##  $ ServiceArea              : chr  "SEAPOR503" "PITHOM412" "MILMIL414" "PITHOM412" ...
##  $ Handsets                 : int  2 2 1 9 4 3 2 3 4 9 ...
##  $ HandsetModels            : int  2 1 1 4 3 2 2 3 3 5 ...
##  $ CurrentEquipmentDays     : int  361 1504 1812 458 852 231 601 464 544 388 ...
##  $ AgeHH1                   : int  62 40 26 30 46 28 52 46 36 46 ...
##  $ AgeHH2                   : int  0 42 26 0 54 0 58 46 34 68 ...
##  $ ChildrenInHH             : chr  "No" "Yes" "Yes" "No" ...
##  $ HandsetRefurbished       : chr  "No" "No" "No" "No" ...
##  $ HandsetWebCapable        : chr  "Yes" "No" "No" "Yes" ...
##  $ TruckOwner               : chr  "No" "No" "No" "No" ...
##  $ RVOwner                  : chr  "No" "No" "No" "No" ...
##  $ Homeownership            : chr  "Known" "Known" "Unknown" "Known" ...
##  $ BuysViaMailOrder         : chr  "Yes" "Yes" "No" "Yes" ...
##  $ RespondsToMailOffers     : chr  "Yes" "Yes" "No" "Yes" ...
##  $ OptOutMailings           : chr  "No" "No" "No" "No" ...
##  $ NonUSTravel              : chr  "No" "No" "No" "No" ...
##  $ OwnsComputer             : chr  "Yes" "Yes" "No" "No" ...
##  $ HasCreditCard            : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ RetentionCalls           : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ RetentionOffersAccepted  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NewCellphoneUser         : chr  "No" "Yes" "Yes" "Yes" ...
##  $ NotNewCellphoneUser      : chr  "No" "No" "No" "No" ...
##  $ ReferralsMadeBySubscriber: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ IncomeGroup              : int  4 5 6 6 9 1 9 6 9 5 ...
##  $ OwnsMotorcycle           : chr  "No" "No" "No" "No" ...
##  $ AdjustmentsToCreditRating: int  0 0 0 0 1 1 1 0 0 1 ...
##  $ HandsetPrice             : chr  "30" "30" "Unknown" "10" ...
##  $ MadeCallToRetentionTeam  : chr  "Yes" "No" "No" "No" ...
##  $ CreditRating             : chr  "1-Highest" "4-Medium" "3-Good" "4-Medium" ...
##  $ PrizmCode                : chr  "Suburban" "Suburban" "Town" "Other" ...
##  $ Occupation               : chr  "Professional" "Professional" "Crafts" "Other" ...
##  $ MaritalStatus            : chr  "No" "Yes" "Yes" "No" ...

Here is the how the variables been divided into different categories as mentioned above:

2.2 Preparation for EDA

  • Created columns like perc_overage_minute from existing columns for fair comparison. For instance, telcm\(perc_overage_minute <- (telcm\)OverageMinutes / telcm$MonthlyMinutes) * 100

  • Converted categorical variables like churn, occupation, credit rating into factor.

  • Created two subsets for churned and retained customers respectively for better analysis and visualizations.

  • Installed packages like plotly to make graphs look more intuitive and elegant.

3 Chapter 3: Slicing the data for an overview

3.1 Churn Rate Distribution

A brief overview of the dataset yields the following observations for Churn Rate: The percentage of churned customers are 28.8% and the percentage of the rest of the customers are 71.2%.

3.2 Monthly Minutes, Age and Occupation


## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 156 rows containing non-finite values (stat_bin).
## Warning: Removed 101 row(s) containing missing values (geom_path).

The mean of monthly minutes for churned customers is 483.8274708 and mean of monthly minutes for retained customers is 542.5464828 .Mean for monthly minutes for churned customers is lesser than the retained customers. From the histogram and QQ Plot we can notice that the distribution of the monthly minutes is right skewed.

## Warning: Removed 909 rows containing non-finite values (stat_bin).

## Warning: Removed 909 rows containing non-finite values (stat_boxplot).

As we can see from the above two plots, age 0 has maximum frequency which is technically not possible. Hence,we can say that there is an error with the data collection and this will be taken care of during the modelling process.

From this distribution, we can notice that Other and Professional has more frequency wheras the occupation homemaker has the least frequency.

4 Chapter 4: SMART Questions Analysis

4.1 Does credit rating have an impact on churn rate?

4.2 At what duration is the churn rate high for the customers?

Here,we want to compare the frequency of churned customers with respect to service months of the customer. For this purpose, we are using a subset of data with only churned customers. Once we look at the histogram below, we can notice that our data is rightly skewed. It can be observed that the customers are mainly churning during the initial 10 to 25 months of the service period. Approximately after 30 months the churn rate is reducing drastically.


4.3 In what Prizm codes are the Churn Rates high?

So we have 4 prizm codes i.e, suburban, rural , town and rural We have plotted a pie chart to see the distribution of churn for the same. For this purpose we have used 2 different subsets for churned and retained. The first one is for the churned customers and we can see that The next is for retained customers There is no significant difference between the 2 graphs. But in town and rural area we can notice that the churn rate is slightly higher than retained customers.

4.4 Is occupation independent of churn ?

We want to know if occupation of the customer has any effect on Churn. Since both Occupation and Churn are categorical variables, we are performing Chi-Square test to check for any dependency between these 2 variables.

H0: Churn is independent of Occupation.
H1: Churn is dependent on Occupation.

## 
##  Pearson's Chi-squared test
## 
## data:  Occupation_Churn
## X-squared = 10.316, df = 7, p-value = 0.1714

From the test, we can notice that the p value is 0.171 which is greater than 0.05, so we will be accepting H0 i.e,the null hypothesis. Therefore, we can conclude that occupation is independent of churn.

4.5 Does the monthly revenue average differ for different occupations?

5 Chapter 5: Correlation Analysis

6 Chapter 6: Conclusion

By performing the analysis, we gained insights on how our data is structured and what factors are affecting churn in the Telecom industry. After conducting EDA on our variables, we were able to choose better SMART questions and were able to assess what was causing customers to churn.

We gained the following results from our analysis:

  • Churned customers use less monthly minutes than the retained customers for similar monthly revenue.

  • Few customers have incorrect age as 0 which could be null.

  • Customers are churning irrespective of their credit rating and the proportion of churned customer is larger on high credit score.

  • Prizm Codes aren’t playing a major difference in Churn rate. However, there is a slight increase in percentage of churned customers compared to other customers in Town and Rural prizm codes.

  • Churn rate is high in the initial 10 to 25 months of the service period.

  • Occupation is independent of churn.

Finally, we have performed some correlation analysis on few variables. - Monthly overage minutes and monthly revenue have moderate positive correlation. - There is need of predictive modelling to predict customer churn well in advance.